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Abstract 

We propose a unified framework for studying both latent and stochastic block mod- 
els, which are used to cluster simultaneously rows and columns of a data matrix. In 
this new framework, we study the behaviour of the groups posterior distribution, given 
the data. We characterize whether it is possible to asymptotically recover the actual 
groups on the rows and columns of the matrix. In other words, we establish sufficient 
conditions for the groups posterior distribution to converge (as the size of the data 
increases) to a Dirac mass located at the actual (random) groups configuration. In 
particular, we highlight some cases where the model assumes symmetries in the matrix 
of connection probabilities that prevents from a correct recovering of the groups. We 
also discuss the validity of these results when the proportion of non-null entries in the 
data matrix converges to zero. 

Keywords and phrases: Block clustering, block modelling, latent block model, posterior distribu- 
tion, stochastic block model. 



1 Introduction 

Cluster analysis is an important tool in a variety of scientific areas including pattern recog- 
nition, microarrays analysis, document classification and more generally data mining. In 
these contexts, one is interested in data recorded in a table or matrix, where for instance 
rows index objects and columns index features or variables. While the majority of clustering 
procedures aim at clustering either the objects or the variables, we focus here on procedures 
which consider the two sets simultaneously and organize the data into homogeneous blocks. 
More precisely, we are interested in probabilistic models called la tent block models (LBMs ) , 



where both rows and columns are p artitioned into latent groups (jGovaert and Nadij . 120031 ) . 



Stochastic block models (SBMs. lHolland et al.l . ll983l ) may be viewed as a particular case 



of LBMs where data consists in a random graph which is encoded in its adjacency matrix. 
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An adjacency matrix is a square matrix where rows and columns are indexed by the same 
set of objects and an entry in the matrix describes the relation between two objects. For 
instance, binary random graphs are described by a binary matrix where entry equals 1 
if and only if there is an edge between nodes in the graph. Similarly, weighted random 
graphs are encoded in square matrices where the entries describe the edges weights (the 
weight being in case of no edge between the two nodes). In this context the partitions on 
rows and columns of the square matrix are further constrained to be identical. 

To our knowledge and despite their similarities, LBMs and SBMs have never been ex- 
plored from the same point of view. We aim at presenting a unified framework for studying 
both LBMs and SBMs. We are more precisely interested in the behaviour of the groups 
posterior distribution, given the data. Our goal is to characterize whether it is possible to 
asymptotically recover the actual groups on the rows and columns of the matrix. In other 
words, we establish sufficient conditions for the groups posterior distribution to converge 
(as the size of the data increases) to a Dirac mass located at the actual (random) groups 
configuration. In particular, we highlight some cases where the model assumes symmetries 
in the matrix of connection probabilities that prevents from a correct recovering of the 
groups (see Theorem [1] and following corollaries) . Note that the asymptotic framework is 
particularly suited in this context as the datasets are often huge. 



One of the first occurrences of LBMs appears in the pioneering work of lHartigan (| 19721 ) 
under the name three partitions. LBMs were later developed as an intuitive extension of 
the finite mixture model, to allow for simultaneous clustering of objects and features. Many 
different names are used in the literature for such procedures, among which we mention 
block clustering, block modelling, biclustering, co-clustering and two-mode clustering. All 
of these procedures differ through the type of clusters they consider. LBMs induce a spe- 
cific clustering on the data matrix, namely we partition the rows and columns of the data 
matrix and the data clusters are restricted to cartesian products of a row cluster and a col- 
umn cluster. Frequ e ntist param eter estimation proce dures for LBMs have been proposed in 
Govaert and Nadij (|2003l . l2008h for binary data and iGovaert and Nadij (l2010h for Poisson 



rando m variables. A Bayesian version of the model has been introduced in lDeSarbo et al. 



(120041 ) for random variables belonging to the set [0, 1], combined with a Markov chain Monte 
Carlo (MCMC) procedure to estimate the model parameters. Moreover, m odel selection in 
a Bay esian setting is performed at the same time as parameter estimation in lWvse and Friel 
(2012), who consider two different types of models: a Bernoulli LBM for binary data and 
a Gaussian one for continuous observations. All of these parameter estimation procedures 
also provide a clustering of the data, based on the groups posterior distribution (at the 
estimated parameter value). To our knowledge, there is no result in the literature about 
the quality of such clustering procedures nor about convergence of the groups posterior 
distribution in LBMs. 



SBMs were (re)-discovered many different times in the liter ature, and introduced at 
first in social sciences to study relational data ( see for instanc e Frank and Hararv , 19821 ; 
Holland et al. . 1983 : Sniiders and Nowicki . 19971 : Daudin et al. . 20081 ). In this context, the 
data consists in a random graph over a set of nodes, or equivalently in a square matrix 
(the adjacency matrix) whose entries characterize the relation between two nodes. The 
nodes are partitioned into latent groups so that the clustering of the rows and columns of 
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the matrix is now constrained to be identical. Various para meter estimation procedures 
have been proposed i n this context, from Bayesian strategies (jSniiders and Nowickil . 119971 ; 
Nowicki and Sniidersl. l200lh. to variational approximat i ons of expectation maxi mization 



(EM) algorithm (IDaudin et al 



2008; 



Mariadassqu et all l2010l : iPicard et al.1. 120091) or vari- 



ation al Bayes approaches ( Latouche et al. . 20121 ') , onl ine procedures ( Zanghi et al. . 20081 , 



20ld ) and direct methods ( Ambroise and Matiasl . 2011 ). Note that most of these works are 



concerned with binary data and only some of the most recent o f them deal with weighted 
random graphs ( Ambroise and Matiasl . 2011 : Mariadassou et al. . 2010l ). 

In each of these procedures, a clustering of the graph nodes is performed according to 
the groups posterior distribution (at the estimated parameter val u e) . T he behaviour of this 
posterior distribution for binary SBMs is studied in lCelisse et al.l (120111). These authors es- 
tablish two different results. The first one (Theorem 3.1 in ICelisse et all l201lh states that 
at the true parameter value, the groups posterior distribution converges to a Dirac mass at 
the actual value of groups configuration (controlling also the corresponding rate of conver- 
gence). This result is valid only at the true parameter value, while the above mentioned 
procedures rely on the groups posterior distribution at an estimated value of the parameter 
instead of the true one. Note also that this result establishes a convergence under the con- 
ditional distribution of the data, given the actual configuration on the groups. However, as 
this convergence is uniform with respect to the actual configuration, the result also holds 
under the unconditional distribution of the observations. The second resul t they obtain 



on th e convergence of the groups posterior distribution (Proposition 3.8 in ICelisse et al. 



201 ll ) is valid at an estimated parameter value, provided this estimator converges at rate 
at least n _1 to the true value, where n is the number of nodes in the graph (number of 
rows and columns in the square data matrix). Note that this latter assumption is not 
harmless as it is not establish ed that s uch an estima t or ex ists, except in a particular setting 
(jAmbroise and Matiasl . l201ll ); see also iGazal et al.l (|201ll ) for empirical results. There are 
thus many differences between our result (Theorem [Tj and following corollaries) and theirs: 
we provide a result for any parameter value in the neighborhood of the true value, we work 
with non-necessarily binary data and our work encompasses both SBMs and LBMs. We 
however mention that the main goal of these authors is different from ours and consists in 
establishing the consistency of maximum likelihood and variational estimators in SBMs. 

Next, to conc l ude w i th the literature c oncer ning SBMs, we sh all mention the works of 
Bickel and Chenl ([20091 ): IChoi et al.1 (|2012T ) and iRohe et al.1 (|201lh on the performances of 



clustering procedures in random graphs. Those articles, which are of a different nature from 
ours, establish that under some conditions, the fraction of misclassified nodes (resulting from 
different algorithmic procedures) converges to zero as the number of nodes increases. These 
results only concern the case of binary graphs, while we sha ll deal both with binary and 
weighted graphs; as well as LBMs. Moreover, the works by iBickel and Chenl (|2009l ) and 



Rohe et al.l (|201ll ) are not based on a probabilistic model and only deal with community 



detection, that is to say finding a set of highly conn ected nodes , this task being more 
restric tive than block modeling. We also mention that IChoi et al.l (|2012l ) and IRohe et al 



( 201ll ) both are concerned with an asymptotic setting where the number of groups is allowed 
to grow as the root of the netwo rk size and the average network de g ree g rows at least 
nearly linearly ( Rohe et al. . 201 ll ) or poly-logarithmically ( Choi et al. . 2012 ) in this size. 
In Section[5]of the present work, we explore the validity of our results in a similar framework, 
by assuming that the numbers of groups remain fixed while the connections probabilities 
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betwe en groups converge to zero. Finally and most importantly, note that IChoi et al 



(|2012h proposes convergence results in a setup of independent Bernoulli random variables 
(viewing the latent groups as parameters instead of random variables), while in our context, 
the observed random variables are not independent. 

We also want to outline that many different generalizations allowing for overlapping 
groups exist, both fo r LBMs and SB Ms. We refer the in t erested reader to the w orks of 
DeSarbo etaP (l2004h for LBMs and lAiroldi et all (120081 ): ILatouche etaP (|201lh in the 
case of SBMs, as well as the references therein. However in this work, we restrict our at- 
tention to non overlapping groups. 



This work is organized as follows. Section [2] describes LBMs and SBMs and introduces 
some important concepts such as equivalent group configurations. Section [3] establishes 
general and sufficient conditions for the groups posterior probability to converge (with large 
probability) to a (mixture of) Dirac mass, located at (the set of configurations equivalent 
to) the actual random configuration. In particular, we discuss the cases where it is likely 
that groups estimation relying on maximum posterior probabilities might not converge. 
Section [5] illustrates our main result, providing a large number of examples where the above 
mentioned conditions are satisfied. Finally, in SectionOwe explore the validity of our results 
when the connections probabilities between groups converge to zero. This corresponds to 
datasets with an asymptotically decreasing density of connections. Some technical proofs 
are postponed to Appendix lAl 



2 Model and notation 

2.1 Model and assumptions 

We observe a matrix X njjn := {Xij}\<i< n ±<j<m of random variables in some space set 
X, whose distribution is specified through latent groups on the rows and columns of the 
matrix. 

Let Q > 1 and L > 1 denote the number of latent groups respectively on the rows 
and columns of the matrix. Consider the probability distributions a = (a\, . . . ,(Xq) on 
Q = {1, . . . , Q} and f3 = (ft, . . . , ft) on C = {1, . . . , L}, such that 

Q L 

Vq £Q,Vi£ C, a q ,Pi > and = l,^ft = 1. 

q=l 1=1 

Let Z n := Zi, . . . , Z n be independent and identically distributed (i.i.d.) random variables, 
with distribution a on Q and W m := W±, . . . , W m i.i.d. random variables with distribution 
(3 on C. Two different cases will be considered in this work: 

Latent block model (LBM). In this case, the random variables {^j}i<j< n and {Wj}i<j<, 
are independent. We let I = {1, . . . ,n} x {1, . . . , m} and [i = a® n <g) f3® m the distri- 
bution of (Z n , W m ) := (Zx,..., Z n , Wi, ... , W m ) and set Uij = (Zi, Wj) for in 
X. The random vector (Z n , W m ) takes values in the set Li := Q n x C m whereas the 
{Uij := (Zi,Wj)}(ij} € x are non-indepedent random variables taking values in the set 
(Q x C) nm . 
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Stochastic block model (SBM). In this case we have n = m, Q = C, Zj = Wj for all 
1 < i < n and a = (3. We let I = {1, ... , n} 2 , /x = a.® n the distribution of Z n and 
set Uij = (Zi, Zj) for G X. The random variables {C/jj := (Zj, Zj)}(jj) e j are not 
independent and take values in the set 

U = {{(gi,9i)}(ij)ez;Vi e {1,... e Q}. 

This case corresponds to the observation of a random graph whose adjacency matrix 
is given by {Ajj}i<jj< n . As particular cases, we may also consider graphs with no 
self-loops in which case X = {1, . . . , n} 2 \ {(», i); 1 < i < n}. We may also consider 
undirected random graphs, possibly with no self-loops, by imposing symmetric adja- 
cency matrices = Xji. In this latter case, X = {1 < i < j < n}. In the following, 
some formulas are given in full generality, and one should take ft = 1 for any I to 
obtain corresponding expressions in SBM. 

We introduce a matrix of connectivity parameters 7r = (^ q i)( qi i)^QxC belonging to some 
set of matrices IIq£ whose coordinates ir q i belong to some set II (note that Hqc may 
be different from the product set 11^). Now, conditional on the latent variables {Uij = 
(Zi, Wj)}uj-\ & x, the random variables {Xij}uj) e x are assumed to be independent, with a 
parametric distribution on each entry depending on the corresponding rows and columns 
groups. More precisely, conditional on Zj = q and Wj = I, the random variable Xy follows 
a distribution parameterized by ir q i. We let f(-;ir q i) denote its density with respect to some 
underlying measure (either the counting or Lebesgue measure). 

The model may be summarized as follows: 

• (Z n , W m ) latent random variables in Li with distribution given by n, 

■ ^n,m = {Xij}(ij)(zx observations in X, 

W m ) = ®^ j)£l F(X ij \Z i ,W j ), 

■ £ 1 and V(g,0 € Q x £, we have X^Z^Wj) = (q,l) ~ f(-;ir ql ) 

We consider the following parameter set 

9 = \e = (/x,7r);7r G U QC and V(g, I) G Q x C, a q > a min > 0, ft > /3 min > o}, 

and define a max = max{a g ; ^ G Q; 6* = (jit, 7r) G 0} and similarly ^ max = rnaxjft; / G C;0 = 
(/i,7r) G 9}. We let /i min := a min A /3 min and /i max := a max V /3 max . We denote by F e and 
Eg the probability distribution and expectation under parameter 9. In the following, we 
assume that the observations X n>rn are drawn under the true parameter value 9* G O. We 
let P* and respectively denote probability and expectation under parameter value 9*. 
We now introduce a necessary condition for the connectivity parameters to be identifiable 
from Pg. 

Assumption 1. i) The parameter ir G II is identifiable from the distribution f(-;ir), 
namely f(-;ir) = f(-;n') => vr = ir', 

ii) For all q ^ q' G Q, there exists some I G C such that ir q i ^ Tr q 'i. Similarly, for all 
I 7^ I' G C, there exists some q G Q such that ir q i ^ K q \>. 



(1) 



I. 
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Assumption [T] will be in force throughout this work. Note that it is a very natural 
assumption. In particular, i) will be satisfied by any reasonable family of distributions and 
if ii) is not satisfied, there exist for instance two row groups q ^ q' with the same behavior. 
These groups (and thus the corresponding parameters) may then not be distinguished 
relying on the marginal distribution of Pg on the observation space Note also that 
Assumption [1] is in general not sufficient to ensure identifiability of the parameters in 
L BM or SBM. Identifi ability results for SBM have first been given in a particul a r case 



m lAllman et al.1 (|200d ) and then later more thoroughly discu ssed in lAllman et al.1 toilh 
for undirected, binary or weighted random graphs. See also ICelisse et al.l d201lf i" for the 



case of directed and binary random graphs. 

In the following, for any subset A we denote by either 1a or 1{^4} the indicator function 
of event A, by \A\ its cardinality and by A the complementary subset (in the ambient set). 

2.2 Equivalent configurations 

We introduce a concept that will enable us to d eal with possible symmet r ies in the parameter 



matr ices tv. For instance, the affiliation model ([Frank and Hararvl . ll982l : lAmbroise and Matias 



20 111 ) is a particular case of SBM where the parameter matrix tv writes tv = (X — v)Iq + 
uIqIq, with Iq the Q x Q identity matrix and 1q the Q-length vector of Is. In other 
words, the model (that is motivated by parsimony) is characterized by only two differ- 
ent types of connections: inner-group connections all happen with the same probability 
A, whereas outer-group connections happen with probability v. In this case, for any per- 
mutation s of Q, the permuted matrix (^ s ( q ) s (i))i<q,i<Q is equal to the original tv. As a 
consequence, we have 

X„ jn | Z n = X nj „ | s(Z re ), under parameter value tv, 

where = d means equality in distribution and s(Z re ) := (s(Z\), . . . , s{Z n )). Thus, a posteriori 
estimation of the groups distinguishes the different configurations {s(Z n ),s E &q} (where 
we let &q be the set of permutations of Q), if and only if they happen to have different 
probabilities of occurrences. In this latter posteriori estimation will select among 

the set {s(Z n ),s S @q} the configuration whose prior probability is higher. 

More generally for LBMs or SBMs, the quality of a posteriori estimation of the row and 
column groups depends on whether there exist some permutations (apart from the trivial 
identity permutation) that leave the parameter matrix 7r invariant. If this is the case, then 
for instance a model with equal group proportions will recover with equal probability any 
of the configurations obtained by permuting the actual one. It should be stressed that this 
phenomenon is different from the classical label switching issue that arises in finite mixture 
models. LBMs and SBMs also experience the label switching issue: any permutations on 
the labels of the rows and columns groups will induce the same distribution on the data 
matrix but with rows and columns of tv permuted accordingly. Here, we rather point out 
the fact that for some constrained models, there might exist permutations on the rows and 
columns groups that leave the connectivity parameter tv invariant. As a consequence, when 
comparing the actual group configuration and its permuted version, a posteriori distribu- 
tion does not rely on the data anymore. Indeed, the difference between those posterior 
probabilities is equal to the difference between their prior probabilities. 
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We let &q and 6^ be the sets of permutations of Q and C respectively. For any 
(s,t) G &q x & L , we let 

vr 5 .* : = (ir s q i)( q j) eQx c := (n s ( q )t(i))( q ,i)eQx£- 

Fix a subgroup © of Sq x and a parameter set IIg£. Whenever for any pair of 
permutations (s, t) £ & and any parameter tv G IIg£ we have Tv s,t = tv, we say that the 
parameter set Uqc is invariant under the action of G. In the following, we will consider 
parameter sets that are invariant under some subgroup 6 of &q x &l- This includes the 
case where (3 is reduced to identity pair and the parameter set is the unconstrained product 
set n<3 L . We will moreover exclude from the parameter set Uqc any point tv admitting 
specific symmetries, namely such that there exists some pair (s, t) G (&q x &l)\& satisfying 
7v s,t = tv. Note that this corresponds to excluding a subset of null Lebesgue measure from 
the parameter set Uqc- 

Assumption 2. The parameter set Uqc is invariant under the action of some (maximal) 
subgroup & of &q x &l- Moreover, for any pair of permutations (s,t) G (&q x 
and any parameter tv G Uqc, we assume that Tv s,t ^ tv. 

Example 1. In SBM, we consider & = {(Id, Id)} and let 

Uqc = {tv G n Q2 ;V(s,t) G &q x S L ,(s,t) / (Id, Id), we have tv 8 ' 1 ^ tv}. 

Example 2. In SBM, we consider & = {(s,s);s G &q} and let Uqc = {(A — v)Iq + 
v\\\ Q -\v£ (0,1),A^i/}. 

Whenever (3 is not reduced to the identity singleton pair, each parameter value tv G Uqc 
induces many different equivalent configurations. More precisely, for any (s, t) G & and any 
tv G Uqc, we have 

X n/m | {Z n , W m } = X n , m I {s(Z n ), t(W m )}, under parameter value tv, 

which means that the difference between the posterior distributions IP#({Z n , }jX n — 
Pg({s(Z n ),t(W m )}\X 

n,m) does not depend on the data X n m . 

Remark 1. As already said, in SBM with affiliation structure, the group of permutations 
(s,s) with s G &q leaves the parameter set Uqc invariant. For more general models, let us 
consider (s,t) = ([q,q'], [1,1']) G 6q x &l where [q,q'] is the transposition of q and q' in Q 
and [1,1'] is the transposition of I and I' in C Then any tv G Uqc satisfies 

Vi G Q \ {q,q'}, Tv a = TV iV , 

Vje£\{l,l'}, TVqj = Ttq'ji 
K q l = TTq'l' and lT q i[ = TT q i> . 

In particular, for Assumption^ to be satisfied while ([q,q'], [1,1']) belongs to & that leaves 
Uqc invariant, it is necessary that both ir q i ^= ir q ii and ir q >i ^ 7r g /;/ . 

Note that the parameter sets Uqc that we consider are then in a one-to-one correspon- 
dence with the subgroups 6 of <5q x &l- Note also that we have |(3| < Q\L\ in general 
and |6 1 < Ql in the particular SBM. 

We now define equivalent configurations in hi. 
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Definition 1. Consider a parameter set Hqc invariant under the action of some sub- 
group & of Gq x Si and fix a parameter value tv G Hqc- Any two groups configurations 
(z n ,w m ) := (z 1 ,...,z n ,w 1 ,...,w m ) and (z' n ,w' m ) := (z[, . . . , z' n , w[, . . . , w' m ) in U are 
called equivalent (a relation denoted by (z n ,w m ) ~ (z' 

m w m)) if and only if there exists 

(s, t) G & such that 

(s(z' n ),t(w' m )) := (s(z[), . . .,s(z' n ),t(w' 1 ), . . .,t(w' m )) = (z n ,w m ). 

We let IA denote the quotient of IA by this equivalence relation. Note in particular that if 
(z„,w m ) ~ (z' n , w' m ) then for any 7r G Yl QC , we have (vr 2i ^.)(ij)e2 = (^zy 3 ){i,j)eT- 

For any vector u = (ui, . . . ,u p ) G MP, we let ||u||o := Yli=i ^{ u i 0}- The distance 
between two different configurations (z n ,w m ) G IA and (z' n ,w' m ) G U is measured via the 
minimum || • ||o distance between any two representatives of these classes. We thus let 

d((z n , w m ), (z' n , w' m )) := min{||z n - s(z' n )|| + ||w m - t(w' m )|| ; (s,t) G 6}. (2) 

Note that this distance is well-defined on the space IA. Note also that when S is reduced 
to the identity pair, the distance d(, ) is an ordinary £q distance. 

2.3 Most likely configurations 

Among the set of all (up to equivalence) configurations U, we shall distinguish some which 
are well-behaved in the following sense. For any groups q G Q and I G C, consider the 
events 

n 

A q = {w G VL-N q {Z n (uj)) := = q} < n^ min /2}, 

i=l 
m 

and Bi = |u; G O; iV,(W m (u;)) :=^l{P^(u;) = 1} < m/W2}- 

7=1 

Since N q (Z n ) and Ni(W m ) are sums of i.i.d Bernoulli random variables with respective 
parameters a* and /3f, satisfying a* A /3 ; * > /i m i n , a standard Hoeffding's Inequality gives 

P*(,4 9 U Bfi < exp[-n(a*) 2 /2] + exp[-m(ft) 2 /2] < 2exp[-(n A m)/i^ in /2]. 

Taking an union bound, we obtain F > *(U( (?j /) e g X £ (A q U B t )) < 2QLexp[-(n A m)fi 2 m j2}. 
Now, consider the event defined by 

n := {oj G n;V(g,Z) G Q x C, N q (Z n (uj)) > n// min /2 and JV,(W TO (a;)) > m/i min /2} 

= n W)eQ x£KnB l ), (3) 

which has P*-probability larger than 1 — 2QLexp[— (n A m)jU^ in /2] and its counterpart IA° 
defined by 

U° = {(z n ,w m ) G U;V(q,l) G Q x £,N q (z n ) > n/i min /2 and iV/(w m ) > mfi min /2}, (4) 

where N q (z n ) := £^=1 Ij^i = ^1 an d N[(\v m ) is defined similarly. We extend this notation 
up to equivalent configurations, by letting U° be the set of configurations (z n ,w m ) G IA 
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such that at least one (and then in fact all) representative in the class belongs to U°. 
Note that neither N q (z n ) nor iV/(w m ) are properly defined on U, as these quantities may 
take different values for equivalent configurations. However, as soon as one representative 
(z n ,w m ) belongs to U , we both get N q (z' n ) > n/j, m - m /2 and iV^w^) > m/j, m - m /2 for any 
(z' n ,w' m ) ~ (z n ,w m ). In the following, some properties will only be valid on the set of 
configurations U°. 



3 Groups posterior distribution 

3.1 The groups posterior distribution 

We provide a preliminary lemma on the expression of the groups posterior distribution. 

Lemma 1. For any n, m > 1 and any # G 0, the groups posterior distribution writes for 
any (z n , w m ) G U, 

p 9 nm (z n ,vs m ) :=P e ((Z n ,W m ) = (z n , w m )|X n , m ) 

n m 

oc( n /(^;^))(n^)(n^)' ( 5 ) 

(ij)6l i=l 3=1 

where oc means equality up to a normalizing constant and where we let Pi = 1 in SBM. 
The proof of this lemma is straightforward and therefore omitted. 

In the following, we will consider the main term in the log ratio \ogp e n m (z* , w£J — 
logp^ m (z n , w m ) for two different configurations (z*, w£J, (z n , w m ) € More precisely, 
we introduce 

/ f(X{j] 7T Z * W *) \ 

V(z*,w^),(z n ,w m ) eU, 5 n (z* n ,w* m ,z n ,w m ) = 2^ log — - 1 3 . (6) 

(iJ)6Z ij^ZiWj) J 

Note that this quantity is well-defined on W x W. We also consider its expectation, under 
true parameter value 0* and conditional on the event (Z n ,W m ) = (z*,w^); namely for 
any (z*,w^) and (z n , w m ) G W, we let 

A"(z* n ,w* m ,z n ,w m ) = ^ Ejlog( ■ J ) (Z w ,W TO ) = (z*,w*) . (7) 

(ij)GX V ^ iJ ' / 

Probabilities and expectations conditional on (Z n ,W m ) = (z*,w£J and under parameter 
value 6* will be denoted by p^ nWm and E^" Wm , respectively. 



3.2 Assumptions on the model 

The results of this section are valid as long as the family of distributions {/(•; 7r); tt G 11} sat- 
isfies some properties. We thus formulate these as assumptions in this general section, and 
establish later that these assumptions are satisfied in each particular case to be considered. 

The first of these assumptions is a (conditional on the configuration) concentration in- 
equality on the random variable 5™(Zi n , W m , z n , w m ) around its conditional expectation. 
We only require it to be valid for configurations (Z n ,W m ) = (z*,w£J G U°. Note that un- 
der conditional probability P*" Wm , the random variables {Xij; (i,j) G 1} are independent. 
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Assumption 3. (Concentration inequality). Fix (z*,w£J G U° and (z n ,w m ) G IA such 
that (z n ,w m ) 7^ (z*,w£J. There exists some positive function ip* : (0, +oo) — > (0; +oo] 
such that for any e > 0, we have 



<r(z*,w^, Zri ,w m )-E^ m ^ 7r (z*,w^,z n ,w m )J > e(mn + nr 2 ) ^ 

< 2exp[— ^*(e)(mri + nr2)], (8) 

where the distance d((z*,w*J, (z n ,w m )) defined by ([2]) is attained for some pair of permu- 
tations (s,i) G & and we set r± := ||z* — s(z n )||o and r 2 ■= ||wj^ — i(w m )||o. 

Remark 2. Assumption^ is reasonable and is often obtained by an exponential control of 
the centered random variable 

r "' = los 7(^- E 'l lo H7(^j 

uniformly in tt,tt' G II. As shown in Section^ as soon as 

^ma X (A) := sup K 7T (exp(XY w y)) 

7r,7r'GlI 

is finite for X in a small open interval I C R around 0, a Cramer- Chernoff bound shows 
that Equation ([Sjj is satisfied with 

^*( £ ) := ^|in sup(Ae -^ max (A)). 
° AG/ 

The second assumption needed is a bound on the Kullback-Leibler divergences for ele- 
ments of the family {/(•; 7r); 7r € II}. We let 

D(*\W) := log (|^4) fix; 7r)dx. (9) 

Assumption 4. (Bounds on Kullbak-Leibler divergences). We assume that 

Kmax : = max{L'(7r||7r / ); ir, n' G IT} < +oo. 

Note that K max < +oo is automatically satisfied when the distributions in the family 
{/(•;7r); 7r G IT} have same support. In particular, this is not the case for Bernoulli dis- 
tributions when we authorize some probabilities ir to be or 1. In the following, we thus 
exclude the possibility that classes may be almost never or almost surely connected. We 
also introduce 

Kmin = K min (7r*) := min{D(7r* z \\n* n ,) ; (q, I), (g', l') G Q x £,tt^ tt* t } > 0, (10) 

where positivity is a consequence of Assumption [TJ The parameter K m i n measures how far 
apart the non-identical entries of 7r* are and is the main driver of the convergence rate of 
the posterior distribution. 

The last assumption needed is a Lipschitz condition on an integrated version of the 
function ir i— > log f(x; ir). 

Assumption 5. There exists some positive constant Lq such that for any 7r, 7t' G ITg£ and 

any (q,l), (q' ,1') G Q x C, we have 



f , f(x;n q i) 



< Ln||7T — 7T 
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3.3 Convergence of the posterior distribution 

We now establish some preliminary results. The first one gives the behavior of the condi- 
tional expectation A n defined by ([7]) with respect to the distance between the two config- 
urations (Z n ,W m ) and (z n ,w m ). 

Proposition 1. (Behavior of conditional expectation). Under Assumptions[Jl\^and^ the 

constant C = 2K max > is such that for any parameter value 7r G IIq£ o,nd any sequence 
(z n ,w m ) G U, we have F±-almost surely 

Ef lWm (^(Z n ,W m ,z n ,w m )) < ^(m n +nr 2 ), (11) 

where the distance d((Z„, W m ), (z n , w TO )) is attained for some (s,t) G & and we set r\ := 
||Z n - s(z n )|| and r 2 := ||W m - t(w m )\\ . 

Furthermore, under additional Assumption^ the constant c = /i^ in K m i n /16 G (0, C/4) 
is suc/i £/ia£ on t/ie set Qq defined by Q whose P* -probability satisfies P*(f2o) > 1 — 2Q-L x 
exp[— (n A m)/i^ in /2], /or any parameter value 7r £ IIq£ and any sequence (z„,w m ) G 
we /iaue 

E?" Wm (<5 7r (Z n ,W m ,z n ,w m )) >2(c-Lo||7r-7r*|| 00 )(mri + nr 2 ). (12) 
Proof. Note that 

^ Wm (V(Z n ,W m ,z n ,w m )) = ^ E^ w -(5 7r (z*,w^,z n ,w m ))xl (ZniWm)=(z;iiW;n) , 

so that we can work on the set {(Z n , W m ) = (z* , w£J} for a fixed configuration (z* , w£J G 
Moreover, we can choose (z n ,w m ) G that realizes the distance d((z*,w^J, (z n , w m )), 

namely such that d((z*, w£J, (z n , w m )) = ||z* - z n || + ||w^ - w m || = n + r 2 . 

If (z n ,w m ) = (z*,w^), namely n = r 2 = 0, then we have ^(z*, w^, z n , w m ) = 

and the lemma is proved. Otherwise, we may have r\ or r 2 equal to zero but r\ + r 2 > 1. 

Without loss of generality (w.l.o.g.), we can assume that z* , z n (respectively wj^, w m ) differ 

at the first r\ (resp. r 2 ) indexes. 

First, let us note that 

E«(o-(z*,w^z n ,w m )) = £ / log ( /(X; ^ M) )/(x;4 w p^, (13) 

where Z = l\ {(i, j); i > n and j > r2}. This leads to 

* * / \ C 

E* nWm ^ 7r (z*,w^ i , z„,w m )J < (mri + nr 2 - rir 2 )K max < — (mri+nr 2 ), 

with C = 2K max , which establishes Inequality (jlip . 

To prove Inequality (|12p . we write the decomposition 

+ D(^ W ,\K )+f log ^^ \ f(x-,^ w *)dx}. (14) 

1 J J Jx f[.X]-K z%w ,) ' o J 
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According to Assumption [5l the third term in the right-hand side of the above equation is 
lower-bounded by — Loll 

7r — ||oo ('W^i H~ TIT2 — ^1^2)- The first term in this right-hand, side 
is handled similarly as we have 



(ij)ez (i,j)ex 



< L |k - 7r*||oo(mri + nr 2 - nr 2 ), 



where the second inequality is another application of Assumption [5j 

The central term appearing in the right-hand side of decomposition (|14p is handled 
relying on the next lemma, wh ose proof is postpon ed to Appendix El It is a generalization 
to LBM of Proposition B.5 in ICelisse et al.1 lj201ll ) that considers SBM only. This lemma 



bounds from below the number of pairs such that 

* / * 

and establishes that it is of order mr\ + nr 2 . This is possible only for the configurations 
(z*,w^J £ U° defined by For the rest of the proof, we work on the set Qo, meaning 
that we assume {(Z n ,W m ) = (z*,w£J S U }. 

Lemma 2. (Bound on the number of differences). Under Assumptions [7] and{^ for any 
configurations (z n , w m ) G U and (z*,wj^) € U°, we have 



u 2 . 

,) eX;vr* / tt** *}| > -^(mn +nr 2 ), (15) 



z m ,w m ,z* w* 



where the distance d((z n , w m ), (z*, w£J) is attained for some pair of permutations (s, t) £ & 
and we set r\ := ||z n — s(z*)||o and r 2 := ||w m — i(w£J||o. 

According to Assumption 01 if tt*. w . 7^ ^** w *, the divergence D(7r** w *\\iTg. W j) is at least 
^min- We thus get 

2 

Enf * 11 * \ ^ /J-min^min , -. 
^(t^If^) ^ § (m ri +nr 2 ). 



Coming back to (|14p and (|13p . we obtain 



£ / log r^ Z> i\ )f{ x -^ w ,)dx > _ 2£o | K _ K*\\\(rnn + nr 2 ) 



(W)ex 
and thus conclude 



E < W -(>(<, W ^ Zn ,w m )) > ^min^min _ 2Lo||7r _ 7r , ||oo ^ ^ + 

By letting c = /x^Kmin/16 we obtain exactly (fT2"j) . We moreover remark that 2c < C/2. □ 
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In the following, we will consider asymptotic results where both n and m increase to 
infinity. The next assumption settles the relative rates of convergence of n and m. With 
no loss of generality, we assume in the following that n > m, view m = m n as a sequence 
depending on n and state the convergence results with respect to n — > +00. 

Assumption 6. (Asymptotic setup). The sequence (m n ) n >i converges to infinity under 
the constraints m n < n and (log n)/m n — > 0. 

We now state the main theorem. 

Theorem 1. Under Assumptions[l\to® following the notation of Proposition^ for anyq £ 
(0,c/(2Lo)), there exists a family {e n ,m}n,m of positive real numbers with J2n e n,m n < +00, 
such that on a set f2i whose ^^-probability is at least 1 — £n,m 

and for any 9 = (/i, 7r) E 
satisfying \\ 7T — tt^IIco — ^7? have for QTiy (z^jW^) G 1A and any G © 

(c - 2L \\n - 7r*|| tx ,)(mri + nr 2 ) - K(\\s(Z n ) - z n \\ + ||t(W m ) - w m || ) 

< fog ^ v v m ^ < C (mn + nr 2 ) + iT(|| 5 (Z n ) - z n \\ + ||*(W m ) - w m || ), (16) 

Pn,rn\ L n,Wm) 

where the distance d((s(Z n ), t(W m )), (z n , w m )), which does not depend on (s,t), is attained 
for some invariant (s,t) G & and we set n := ||Z n — s(z n )||o and r 2 := ||W m — t(w m )||o 
and K = log(a max /a min ) V log(/? max //3 min ). 

Let us comment this result. Inequality (|16p provides a control of the concentration 
of the posterior distribution on the actual (random) configuration (Z n ,W m ), viewed as 
an equivalence class in 14. Its most important part is its left-hand side that provides a 
lower bound on the posterior probability of any configuration equivalent to the actual 
configuration (Z n ,W m ) compared to any other configuration (z n ,w m ). In this inequality, 
two different distances appear between these configurations, namely the £q distance and the 
distance d(,) given by ([2]), on the set of actual configurations (so that d(,) is linked with 
the parameter 7r and its symmetries). When the subgroup (3 is reduced to the identity pair 
(no symmetries allowed in 7r), these two distances coincide and the statement substantially 
simplifies. Another case where it simplifies is when K = 0, corresponding to a max = a m in 
and /3 max = /3 m i n or equivalently to uniform group proportions. These two particular cases 
are further expanded below in the first two corollaries. In general, the two different distances 
appear and play a different role in this inequality. In particular, consider Inequality (11611 
with for instance s = Id = t. It may be the case that a putative configuration (z n ,w m ) is 
equivalent to the actual random one (Z„,W m ) in the sense of relation ~, and thus their 
distance d(, ) is zero (ri = r 2 = above), but their £q distance is large. Then, the posterior 
distribution p e nm will not concentrate on (z n ,w m ) due to the existence of different group 
proportions that help distinguish between (Z n ,W m ) and this equivalent configuration 
(z n ,w m ). The extent to which the group proportions [i are different is measured by K = 
log(a max /a m i n ) Vlog(/3 max //3 m ; n ). When this quantity is small compared to the term c—2LqT] 
(depending on 7r, the connectivity part of the parameter) appearing in the left-hand side 
of (fT6j) . the term If (||Z n — z n ||o + ||W m — w m ||o) is negligible and the posterior distribution 
Pn m wm n °t distinguish between the actual configuration and any equivalent one. 

Before giving the proof of the theorem, we provide some corollaries that will help un- 
derstand the importance of the previous result. The first two corollaries deal with special 
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setups and the third one is an attempt to give a general understanding of the behaviour 
of the groups posterior distribution. All these results state that, under some appropriate 
condition, the posterior distribution p e n m concentrates on the actual random configuration 
(Z n ,W m ), with large probability. We stress the fact that the results are valid for any 
parameter value 9 (satisfying some additional assumption) and not only the true one 9*. 

Corollary 1. ( Case & = {(Id, Id)}.) Under Assumptions^ to\6\ and when & = {(Id, Id)}, 
we obtain that on the set £l\ whose P* -probability is at least 1 — e n ^ m , for any parameter 
9 = (ju,7r) G G satisfying \\ 

p e njm (Z n ,W m ) > 1 - a n>m exp(a n , m ) and p e n>m (Z n ,W m ) < (1 + 6n,me 6 "' m ) _1 , (17) 

where a n<m = ^ ne ~(c-2L 0V )m.+K +me -(c-2L 0V )n+K^ an db nim = (ne- Cm - K + me~ Cn ~ K ) both 
converge to as n — > +oo. As a consequence, relying on the maximum a posteriori (MAP) 
procedure, at a parameter value 9 = (fi, ty) such that ~h converges to the true parameter 
value 7T*, namely 

(Z n , W m ) := argmax p e n m (z n ,w m ), where 9 = (fi,Tt) and it — > 7t* , 

(z n ,w m ) 

the number of misclassified rows and/or columns on the set fii 

n m n 

l{Zi ^Zi} + J2 HWj + Wj} for LBMs and \{Z t + Z { } for SBMs, 
i=i j=i i=i 

is exactly for large enough n. 

Corollary 2. ( Case of uniform group proportions.) Under Assumptions [7] to [7J and when 
K = 0, we obtain that on the set VL\, for any parameter = (/x, 7r) G G satisfying 1 1 -tt — 
-7T* || oo < rj, we have 

4({(%,w„)eM;(z n ,w m )~(Z„,W m )}) > 1 - \6\a n , m e a ^ (18) 
and p 9 n!m ({(z n ,w m ) eU;(z n ,w m ) ~ (Z n ,W m )}) < (1 + l&K^e^™)- 1 , 

where a n ^ m = (^ ne - m ( c - 2L ov) _|_ me -n(c-2L r])^ an( ^ 0n m = (ne~ mC -\-me~ nC ) both converge 
to as n — > +oo. Moreover 

1 

p e ntm (Z n ,W m ) = |g|P„, m ({(zn,w m ) G IA; (z n ,w m ) ~ (Z n ,W m )}). (19) 

Corollary 3. (General case.) Under Assumptions^ to[7J, we obtain that on the set Q\, for 
any parameter 9 = (/x,7r) G G satisfying \\ 7r — "7r*||oo ^ ^7? have 

p e n>m ({(z n , w m ) G W; (z n , w m ) ~ (Z„, W m )}) > 1 - |6|a n , m e a "-- (20) 
and P ; im ({(z„,w m )EW;(z„,w m )~(Z„,W ra )}) < (1 + |6|& n , m e 6 ™' m )-\ 

where a n , m = ( ne - m( - c - 2L °^ +K + me -"(c-2£o??)+^) anc / ^ = ( ne -mC-^ + me -nC-K^ 
both converge to as n — > +oo. 
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Remark 3. Note that the convergence of the posterior distribution (to the set of configu- 
rations equivalent to the actual random one) happens at a rate determined by the constant 

c - 2L n > 0. 

Typically, the rate of this convergence is fast when tv is not too different from ir* (namely 
II 71 " — tt*||oo an d thus LqT] small) while the connectivity parameters are sufficiently distinct 
(namely K m i n and thus c large). 

When & = {(Id, Id)}, the actual configuration has no other equivalent one and the 
posterior distribution converges to it. When K = 0, group proportions are equal and do 
not discriminate between equivalent configurations. Therefore, all equivalent configurations 
(if any) are equally likely. When & ^ {(Id, Id)} and K > 0, the support of the posterior 
distribution converges to the set of configurations equivalent to the actual one, including the 
actual one. However, the latter may not be the most likely among those. Provided n and 
m are large enough, the most likely configuration is the configuration (z n , w m ) equivalent 
to (Z n ,W m ) which maximizes the quantity 

n m Q L 

^2 io g «*i + ^2 log p*3 = ^2 N i( Zn ) lo s a i + N i( w "i) log A- 

i=l j=l q=l 1=1 

Also note that we control the number of errors made by a maximum a posteriori cluster- 
ing procedure only in the case where © = {(Id, Id)} , namely when there are no symmetries 
in the set of matrices Hqc- In the other cases, this procedure is likely to select a config- 
uration equivalent to the true one, but not equal to it. We stress again the fact that the 
equivalence relation is different from the label switching issue that can not be avoided in 
finite mixture models. 

Proof of Theorem^ We shall exhibit the set Sli on which Inequality (|16p is satisfied. First 
note that we have 

P 9 n,m( s ( Z n),t(W m )) " f a s(Zi)\ A f ^(Wj) 



log '<^™> = {>( z„). «w.), w m )+£ log ( + J> 



i=l ^ ' 3=1 



ftu 



Thus, by letting K = log(a max /a m i n ) V log(/3 max //3 m i n ), Inequality (fTUj) is satisfied as soon 
as we have 

(c- 2Lq\\-k - tt* 1 1 oo ) (mr i + nr 2 ) < S 7V (s(Z n ), t(W m ), z n , w m ) < C(mr 1 +nr 2 ). (21) 

Note that the latter inequality is defined on the set of equivalent configurations U and we 
can thus replace (s(Z n ), t(W m )) by (Z n ,W m ). Let (z*,w£J be a fixed configuration in 
U, consider (z n ,w m ) € U. Whenever (z n , w m ) ~ (z*,w^), we have ri + r 2 = and the 
previous inequality is automatically satisfied. Thus, we consider (z n ,w m ) E U such that 
(z n ,w. m ) / (z*,w£J and let t\ := ||z* - s(z n ) || and r 2 := ||w^-t(w m )|| , where (s,t) € 6 
realizes the distance cZ((z*,w£J, (z n , w m )). We consider the event 

A(z*, w^,z n , w m ) = {6 n (z* n , w* m ,z n , w m ) < (c- 2L \\n - 7r*|| (X3 )(mri + nr 2 )} 

U {5 n (z* n , w^,z n , w m ) > C(mrx + nr 2 )} , 
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where the constants c, C > have been previously introduced in Proposition [T] and satisfy 
< 2c < C/2. We also assume that 7v satisfies c — 2Lo||7r — vr*||oo > 0. According to 
this same Proposition, as soon as the configuration (z*,w£J is regular in the sense that it 
belongs to the set U° defined through Equation ([3]) and following lines, we obtain that on 
the set {(Z n ,W m ) = (z*,w£J}, we have 

2(c-2L |k-7r*|| oo )(mri + nr 2 ) < E^ w »(5 7r (z* ) w^,z n , w ro )) < -(mn+nr 2 ). 

We now control the probability of this event. Conditionally on {(Z n , W m ) = (z* , w£J}, 
the event A(z* , wj^, z n , w m ) is included in the two-sided deviation of ^(z* , w^, z n , w m ) 
from its conditional expectation A 7r (z* , wj^, z n , w m ) at a distance at least 

C 

min{(c - 2Lq\\tt - 7r*|| 00 )(mri + nr 2 ), — (mri + nr 2 )} 



(c - 2L ||7r - 7r*|| 00 )(mri + ?ir 2 ) > (c— 2L ??)(mri + nr 2 ). 



In other words, 



v4(z*,w^,z n ,w m )n{(Z n ,W m ) = (z*,w£J} C 

- A w )(z£,w*,z„,w m ) < -(c-2L ||ir-w*||oo)(mr 1 +nr2)j 

C 



u 



{(<*" - A 7r )( z ;,w^,z n ,w m ) > -(mn+nra)}) 

c{ ($* - A 7r )(z*,w^,z n ,w m ) > (c-2L ?7)(mri + nr 2 )}. 
Combining this sets' inclusions with Assumption [3] yields 



P*(A(z*,w^,z n ,w m ) n{(Z n ,W m ) = «,w*)}) < P*((Z n ,W m ) = (z*,w£J) 
xP< w -(|( ( 5' r -A 7r )(z*,w^,z n ,w m )| > (c -2L 7?) K + nr^ 

< 2exp[-^*(c- 2L r])(mr 1 +nr 2 )]/i(z*,wj 

We now consider the set £l\ defined by 



(22) 



fii = fi n( f| A(Z n ,W m , 

(z n ,w m )eW 

= IJ Pi (^C^n{(Z n ,W m ) = (z>^)}). (23) 

(z*,w*Je«° (z„,w m )ett 

On the set fii, Inequality (|2ip and thus Inequality (|16p are both satisfied. We let 

:= u n {(z* , w* )} = ZK {(*«), i(w£J); (s, t) € 6}, 

be the set of all configurations but those which are equivalent to (z*,w£J. Since for any 
(s,t) G 6, the event ^4(z* , w^, s(z*), i(w£J) has P*-probability zero, we may write 

^i=n o u( \J \J ^(z*,w^,z„,w m )n{(Z n ,W m ) = (z*,w*)}). 

(z*,w*,)ew° (z„,w ra )ew4wm 
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We now partition the set of configurations (z n ,w m ) 6 £/ z « w m according to the distance of 
each point (z n ,w m ) to (z*,w£J. We write the following disjoint union 

n+m 

U<™™ ■= W^ w "(ri,r 2 ) 

ri+r 2 =l 
n+m 

:= |J {(z n ,w m ) G^ z - w -;(i((z;,w^);(z n ,w m )) = ||z* - s(z n )|| 

ri+r 2 =l 

+ ll w m - *(w m )||o and ||z* - s(z n )|| = n, ||w* - *(w m )|| = r 2 }. (24) 

Note that the above decomposition is not unique. Indeed, we may have that the distance 
d((z*, w m ); (z n , w m )) = ?*i + r 2 = + r 2 but ri ^ r[ and r 2 7^ r 2 . In such a case, we make 
an arbitrary choice between the couples (ri,r 2 ) and (r^,r 2 ) to represent the distance from 
(z n ,w m ) to (z*,w m ). This decomposition leads to 

P*(ni)<P*(fi„) + 2 £ /x«,w^) 
(z*,w^)ew" 

x |W z " Wm (ri,r2)[exp[— tp*(c — 2L r])(mri + nr 2 )]. 

n+r2 = l 

Now, we use the bound 

|W< w »(ri,r^)| < |6|hM, (25) 

which leads to 

n+m / \ / \ 

P*(Hi) <P*(n ) + 2 l©lf r n jf^)exp[-^(c-2L r ? )(mr 1 +nr 2 )] 
ri+r2=l V r l/ \ r 2/ 

< P + (n ) + 2|6| [{1 + exp[-mV*(c - 2L r/)]} n {l + exp[-nV*(c - 2L r/)]} m - 1 . 

We now rely on the following bound, valid for any u, v > 0, 

(1 + u) n x (1 + v) m - 1 < (rat + mv) exp(nu + mo). (26) 

Combining the latter with the control of the probability of Qq given in Proposition [TJ we 
obtain 

P*(TTi) < 2QLexp(-(riAm)^ in /2) + 2|6|d n , m exp(d n , m ), 

where d njm = [nexp{— ip*(c — 2Lor])m} + mexp{— ip*(c — 2Lor])n}]. 

Note that as soon as (m n ) n >i is a sequence such that m n — > +00 and (logra)/m n — > 0, 
we obtain that for any constant a > 0, the sequence u n = nexp(—am n ) is negligible with 
respect to n _1_s , for any s > 0, and thus X]n n « < +°°- I n particular, the sequence 

£n,m := 2QL exp[-(n A m)/i^ in /2] + 2|6|d njm exp(d n , m ) 

satisfies ^n e «,m„ < +00. This concludes the proof. □ 
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Proof of Corollaries and 0. The proof of these three corollaries relies on the same scheme 
that we shall now present. First note that Qi = U( z * ]W *j g ^o(Jli fl {(Z n , W m ) = (z*,w£J}). 
Let us fix some configuration (z*,w£J in U°. On the set fii fl {(Z n , W m ) = (z*,w£J}, we 
have 

1 _ « (1(7 W W <T 1 -<m(((Zn,W m )» 

Pn ' m(U mjlj - p£, m ({(z„,w m )}) 

Pn,m ({( z n' w m)}) 



^ exp ( - lo 



(z„,w m )GW 
(z„,w m )^(z*,wj t ) 



<m( z n,W m ) 



where we abbreviate to {(Z n , W m )} and {(z* , w£J} the whole sets of configurations {(z n , w. m 
(Z n ,W m )} and {(z n ,w m ) ~ (z*,w^)}, respectively. Let (z n ,w m ) / (z 

rn w m)- There ex- 
ists (s, t) € 6 such that ||z n — s(z*)||o = T\ and ||w m — i(w£J||o = r 2 - Using the left-hand 
side of Inequality (fT6|) and || 



, Pn,m({«>W^)}) p« (s(z*),i(w^)) 

log 2 — 7 — > log 2 — ? — > (c - 2L n)(mri + nr 2 ) + A(ri + r 2 ) 

K,m( z n,W m ) p^ m (Z re ,W m ) 

and therefore 

l-< m ({(Z n ,W m )}) < exp[-(c-2L ?7)(mr 1 +nr 2 )+ J PC(r 1 +r 2 )]. (27) 

(Zn,W m )eW 

(z„,w m )^(z*,w^) 

When 6 = {(Id, Id)}, the set {( 

z n> w 7?i) ~ (Z n ,^V m )} reduces to a singleton and the 

previous bound becomes 

l-p* )TO (Z nj W ro )< exp[-(c-2L ??)("in+nr 2 ) + A'(r 1 + r 2 )]. 

(z„,w m )ew 

(z„,w m )/(z*,w^) 

Using the decomposition (|24|) on the set U ZnWm and the bound (|25[) on the cardinality of 
each W z » ,w "i(ri, r 2 ), we get 

ra+m / \ / \ 
/ n \ / m \ 

j r))(mri + nr 2 ) + if (n + r 2 )] 



l-< m (Z n ,W m )< ^ ( n jPj e xp[-( c -2L c 

ri+r2=l ^ ' ^ ' 



ri+r 2 

= {(1 + exp(-m Cl + K)) n (l + exp(-n Cl + K)) m - 1}, 
where c\ = c — 2LqT]. Using again Inequality ()26|) . we obtain 

1 Pri-Tni^m*^^™) — "n,m6Xp((ln,m)] 

where O n , m = ( ne -(c-2L r?)m+^ + me -(c-2L oV )n+K^ 

The case where K = is handled similarly and gives 

l-< m ({(Z n ,W m )})<|6| exp(a„ im ), 
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where a n , jm = (ne ( c 2L ov)m _|_ me (c 2L n)n^ Moreover when if = 0, we have a± = ■ ■ ■ = 
osq and f3± = ■ ■ ■ = j3i and it easy to check that 

p* iTO (Z n ,W m ) =p e nim (s(Z n ),t(W m )) 

for all (s,t) e e. 

Now, in the general case, we come back to (|27p . Using the decomposition (|24p on the 
set ^/ z " w ™ and the bound (|25p on the cardinality of each W Zn,Wm (ri, r2), we get 

n+m / \ / \ 

l-< m ({(Z n ,W m )})< \e\( n )( rn )exp[-(c-2Lov)(rnr 1 +nr 2 )+K(r l +r 2 )} 

n+r 2 =i ^ riy ^ ^ r2 ' 
< |6|{(1 + exp(-mci + K)) n {l + exp(-n Cl + if)) m - 1}, 

where c\ = c — 2LqT]. Using again Inequality ()26|) . we obtain 

({( W m )}) < |6| exp(a n , m ), 

where a njm = nexp(— mc\ + if) + mexp(— nc\ + K). 

We now provide an upper bound for the posterior probability of the class {(z n ,w m ) ~ 
(Z n , W m )}, valid on the set fii. Let us fix some configuration (z*, w£J in U°. On the set 
ill n {(Z n , W m ) = (z*,w£J}, we have 

1 ^ / , < m ({(Z n ,W m )}) 

1 + 2^ exp I - loj 



<m({( Z n- W m)}) , W7 w ^ ^ < m K' W m) 

(Zn ,W m (Z n , W m J 



and relying on the right-hand side of Inequality (|16p , we get 

Pn,m({( z n> w m)» < I !+ >^ exp - log g — H 

(z n ,w m )/(z*,w^) 

<|l+ ^ exp ^ - C(mri + nr 2 ) - if (n + r 2 )) | . 

(z„,w m )^(z*,w*J 

Following the same lines, we obtain the desired upper-bounds. □ 



4 Examples of application 

The goal of this section is to derive the results of Theorem [1] and following corollaries in 
many different setups. The key ingredient for that lies in establishing the concentration of 
the ratio 5 n around its conditional expectation A n (namely Assumption [3|) . The general 
scheme of proof is first presented, different setups are then explicitly explored. 



4.1 Scheme of proof of concentration inequalities 

One of the main issues for Theorem Q] to be valid is the existence of a concentration of 
the ratio 5 n around its conditional expectation A n , namely Assumption [3l This section 
presents the general methodology that will be employed. 
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The scheme of proof is as follows. Relying on the notation of Assumption[3]and using ([7J) , 
we write 

^( z n> w m. z n w m) ~ A w (z*, W^, Z n , W m ) 

= E logf^ ; ^l -E^log^S^l := V « 



f{Xij;ir Zil 



Conditional on (Z n ,W m ) = (z*,w£J, the random variables l^j are independent and cen- 
tered. There are exactly D := diff(z* , wj^, z n , w m ) such non null variables and since 
D < mr\ + nr2 — r\r<i < mr\ + nr2, we may write 



^ w -(|( ( 5 7r -A 7r )(z*,w^,z ri ,w m )| > e(mn + nr 2 )) < 



£ ^ 



> eD \ . 



(28) 

Thus, the problem boils down to establishing a concentration inequality for the sum Yij 
composed of D conditionally independent and centered random variables. As soon as we 
have the existence of a positive function V'max such that for any e > 0, 



Ml £ M ^ eD \^ 2exp{-V4 ax (e) J D}, 



(29) 



we can combine Lemma [2] and bound (J28J) to obtain 



E ^ £ ( mr l + nr 2) < 2exp | - ^max(^)^min( r " r l + nr 2 )/sj 

:= 2exp | — ip*(e)(mri + W2) j, 

with ?/>*(■) = V'max(')A i min/^- Note that Inequality (f29|) is often obtained through a Cramer- 
Chernoff bound in the following way. We let '■= logE* nWm (exp(AYij)), for any A > 

such that this quantity is finite, let us say A £ I C R. Using a Cramer-Chernoff bound, we 
get for any x > 0, 



(l^ijl > x ) < 2exp < — sup( 



)(Ax - -0ii(A)) f . 

As soon as we can uniformly bound this quantity, namely if we can write 
P? w ™ (1^1 > x) < 2exp { - sup(Ax - ^ max (A))|, 

L AG/ 



with Vmax '•= max (i,j)G2 ^iji the conditional independence of the Yi/s gives that for any 
e > 0, and any A > 0, 



E Y H > £D \ < 2exp{-(A eJ D - ^max(A))}, 
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leading to 

p <w^ I V y > £D \ < 2exp{- J Dsup(Ae - ^ max (A))} < 2exp { - D^(e)V 

where V ; max( e ) := su Pag/(^ £ ~~ ^max(A)). Note that since V'ij(O) = 0, we have ip ma ^(0) = 
and V'max i s non negative. 

4.2 Binary observations 

In this section, we assume that X^ £ {0,1} and f(-',Tv) is a Bernoulli distribution with 
parameter ir. In this case, point i) in Assumption [1] is automatically satisfied. 

We first state a condition on the parameters tt £ II that ensures both that Assumptions^] 
and [H are satisfied. Note that this constraint, although rather general, does not cover the 
cases where some probabilities tt„i may either be or 1. 

Assumption 7. The parameter set II is included in [a, 1 — a] for some a £ (0, 1/2). 

Lemma 3. Under Assumption^ we obtain that Assumption^ is satisfied with ip*(x) = 
x 2 ^ in /[16(log(l-a)-loga) 2 ]. 

Proof. According to Equation ([5]) and Assumption [7] ensuring ir q i ^ 0,1, the log ratio of 
the posterior probabilities 5™ as well as its conditional expectation A 71- are always finite. 
Here, we have 

/ 7Tz*iu* \ /I - n z * w * \ 
Yij = Xij log + 1 - Xij) log *-L + c, 

V KziWj J V 1 - / 

where c is a centering constant. Conditional on (Z n , W m ) = (z* , w£J, the random variables 
l^j are independent, centered and bounded by 2[log(l — a) — log a]. A simple Hoeffding's 
Inequality then yields ([29]) with ?/;* ax (i) = x 2 /[2{log(l — a) — log a} 2 ]. This gives the desired 
the result. □ 

Corollary 4. Consider the model defined by (H|) where f(-',Tr) is a Bernoulli distribution 
with parameter tt £ II. Under ii) of Assumption^ and Assumptions® [?| the conclusions 
of Theorem [7] and Corollaries [7] to are valid. 

Proof. According to Lemma [3] it suffices to prove that Assumptions 0] and [5] are valid. But 
as we assume ir q i ^ 0, 1, the Bernoulli distributions are supported exactly on {0, 1} and the 
requirement K max < +oo is satisfied. Moreover, as ir q i £ [a, 1 — a], Assumption [5] is satisfied 
with Lq = 1/a. □ 

4.3 Binomial observations 

In this section, we assume that £ {0,...,p} and f{-]ir) is a Binomial distribution 
B(p, tt). In this case, point i) in Assumption [1] is satisfied. We shall also make Assumption!?! 
so that Assumptions [3] and H] are also satisfied. 

Lemma 4. Under Assumption^ we obtain that Assumption^ is satisfied with ip*(x) = 
x 2 l4nJl 16 P 2 0-°sO- - «) - !oga) 2 ]. 



22 



M. Mariadassou and C. Matias 



Proof. According to Equation ([5]) and Assumption [7] ensuring 7r q i ^ 0,1, the log ratio of 
the posterior probabilities 5 n as well as its conditional expectation A 77 are always finite. 
Here, we have 

r " - £ 1{x - - k} { kloe © +(p - k) log (^S) } + c - 

where c is a centering constant. Now, the Y^s are bounded by 2p{log(l — a) — log a} and 
the same proof as in Lemma [3] applies. □ 

The proof of the following corollary follows the same lines as for Corollary [5] and is 
omitted. 

Corollary 5. Consider the model defined by §\§ where /(-; 71 ") is a Binomial distribution 
B(p, tt) with parameter tt € II. Under ii) of Assumption [JJ and Assumptions [?| the 
conclusions of Theorem{]\ and Corollaries [JJ fo[3] are valid. 



4.4 Discrete observations 

In this section, we assume that E {1, . . . ,p} and f{-\ir) is a discrete distribution with 
parameter tt = (vr(l), . . . , tt(p)) and f{k; n) = n(k) for any 1 < k < p. In this case, point i) 
in Assumption [1] is automatically satisfied. We state a condition on the parameters tt € II 
that ensures both Assumptions [3] and [J] are also satisfied. 

Assumption 8. The parameter set II is included in [a, 1 — a] p for some a € (0, 1/2). 

Lemma 5. Under Assumption^ we obtain that Assumption [3] is satisfied with ip*(x) = 
^4in/{8^[log(l-«)-loga] 2 }- 

Proof. According to Equation (|5j) and Assumption [8] ensuring TT q i(k) ^ 0,1, the log ratio 
of the posterior probabilities 5 n as well as its conditional expectation A n are always finite. 
Here, we have 

where c is a centering constant. Conditional on (Z n , W m ) = (z* , w^J, the random variables 
Yij are independent, centered and bounded by p[log(l — a) — log a]. A simple Hoeffding's 
Inequality then yields (j29|) with V'maxW = x 2 /{p[log(l — a) — log a] 2 }. This gives the desired 
the result. □ 

Corollary 6. Consider the model defined by ([T]) where /(-; 71 ") is a discrete distribution with 
parameter tt € II. Under ii) of Assumption [I] and Assumptions [3 the conclusions of 
Theorem [7] and Corollaries [7] to are valid. 

Proof. According to Lemma [SJ it suffices to prove that Assumptions U] and [5] are valid. 
But as we assume ir q i(k) ^ 0, 1 for all k, the discrete distributions are supported exactly on 
{l,...,p} and the requirement K max < +oo is satisfied. Moreover, Assumption [5] is satisfied 
with Lq = 1/a. □ 
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4.5 Poisson observations 

In this section, we assume that Xy € N and /(•; is a Poisson distribution with parameter 
7r S II. In this case, point i) in Assumption [1] is automatically satisfied. We state a condition 
on the parameter ir € IT that ensures both Assumptions [3] and [H are also satisfied. 

Assumption 9. The parameter set U is included in [7r m i n , vr max ] C (0;+oo). 

Lemma 6. Under Assumption^ we obtain that Assumption^ is satisfied with ip*{x) = 
Mmin^maxM^Avrmax log(7r max /7r min )))/8 and h(u) = (1 + u) log(l + u) - u, for all u>—l. 

Proof. According to Equation (J5J) and Assumption ensuring ir q i > 0, the log ratio of the 
posterior probabilities <5 7r as well as its conditional expectation A n are always finite. Here, 
we have 

Yij = log ( — ^ ) (Xij - ir z * w *). 

Conditional on (Z n , W m ) = (z*,w£J, the random variables are independent, centered 
and up to a scale factor, these are Poisson random variables. We let 

h(u) = (1 + u) log(l + u) - u, Vu > -1 

and write for any x > 0, a Cramer-Chernoff bound for a Poisson variable 



\Yij\ >x) <P^ W ™ (JX^-tt^*! > 
< 2exp < —ir z * w *h 



log(7r max /7r min ) 

X 



K z *w* log(7r max /7T m i n ) 



(see for instance Massart . 20071 ) . Since for any u > 0, we have 7r t— > — Tvh(u/ir) is increasing 
on (0,+oo), we obtain 



n m 



\Yij\ > x) < 2exp <^ -7r max h 



'""max 



TTmin) / } 



Let D = diff(z n , w m , z' n , w' m ). The conditional independence of the Y^-'s combined with 
the previous Cramer-Chernoff bound yields 



| Yl Yi i\ ^ eD ^ 2exp |-D7r max /i f 



"'max 

log(vr max /7r min ) 



which concludes the proof. □ 

Corollary 7. Consider the model defined by (pQ) where f{-\ir) is a Poisson distribution 
with parameter tt € IT. Under ii) of Assumption^ and Assumptions® the conclusions 
of Theorem [7] and Corollaries [7] to [3] are valid. 

Proof. According Lemma[6l it suffices to prove that AssumptionU]and[5]are valid. But as we 
assume ir q i > 0, the Bernoulli distributions are supported exactly on N and the requirement 
fc max < +oo is satisfied. Moreover, Assumption [5] is satisfied with Lq = 7r max /7r m i n . □ 
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4.6 Gaussian location model 

In this section, we are interested in Gaussian observations in the homoscedastic case. We 
thus assume that Xij € R and /(-; 71 ") is a Gaussian distribution with mean value tt and 
fixed variance a 2 . Namely, we have f(x;iTij) = cexp{— (x — vrjj) 2 / (2a 2 )} , where c is a 
normalizing constant. Note that point i) in Assumption [I] is satisfied. We also require 
bounded values for ir 6 II for concentration inequalities to be uniformly satisfied, namely a 
variant of Assumption [9] as we do not impose positivity on the parameters. 

Assumption 10. The parameter set U is included in [7r m i n , 7r max ] C R. 

Lemma 7. Under Assumption \1(A we obtain that Assumption^ is satisfied with ip*(x) = 

x2fj2 ^min/( 16 ( 7r max ~ 7T min ) 2 ). 

Proof. Since the distributions /(•; ?0 are absolutely continuous with respect to the Lebesgue 
measure, the log ratio of the posterior probabilities S n as well as its conditional expectation 
A 1 are always finite. Here, we have 

Yij = ^(^ZiWj ~ ^zfw^){^ij ~ ^z^w*)- 

Thus, Yij is Gaussian centered with variance (vr 2 . Wj . —tt z * w *) 2 /a 2 . A Cramer-Chernoff bound 
for Gaussian variables gives, for any x > (see for instance iMassartl . 120071 ). 

a 2 x 2 \ _ / a 2 x 2 



\Yij \ > x) < 2exp - — -2 < 2exp . 

\ ^y^ZiWj ^z*w*) J \ ^i^max ^"minj 



« j ■ 



Combining this with the independence of the Y^s and letting D = diff(z* , wj^, z n , w r 
we obtain that for any e > 0, 

Da 2 e 2 



This corresponds to Inequality ([29]) with ^maxl^) = x 2 a 2 /(2(n 

max ^min) )j which gives 

the desired result. □ 



The following corollary is a direct consequence of the previous lemma and the fact that 
Assumption [5] is satisfied in this case with Lq = 7r max /cr 2 . 

Corollary 8. Consider the model defined by ([1]) where f(-;ir) is a Gaussian distribution 
with mean value 7r and fixed variance a 2 . Under ii) of Assumption^ and Assumptions® \1(A 
the conclusions of Theorem [JJ and Corollaries [JJ to [3] are valid. 



4.7 Gaussian scale model 

In this section, we are interested in Gaussian observations with fixed mean and different 
variances. We thus assume that G R and /(-; 71 ") is a Gaussian distribution with fixed 
mean value m and variance tt € (0; +oo). Namely, we have f(x; 7Ty) = c(7Tjj) -1//2 exp{— (x — 
m) 2 j (27Tj,)} , where c is a normalizing constant. Note that point i) in Assumption [JJ is 
satisfied. We also impose bounded values for tt £ II, namely Assumption^ for concentration 
inequalities to be uniformly satisfied. 
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Lemma 8. Under Assumption^ we obtain that Assumption^ is satisfied with ip*(x) = 

Mmin^miW {8(7T max - 7T min )} + ^ it1 log{l + 2lT min x/ (7T max - 7T min )}/16. 

Proof. Since the distributions /(•; ?0 are absolutely continuous with respect to the Lebesgue 
measure, the log ratio of the posterior probabilities 5 n as well as its conditional expectation 
A" are always finite. Here, we have 



7T^*,„* — ir ZiW . j (Xtf — m) 2 

Hi ~ 



Yu = — x I ^ - 1 



Thus, up to a scale factor, Yy follows a centered x 2 (l) (x- s q uare with 1 degree of freedom) 
distribution. A Cramer-Chern off bound for x 2 (l) random variables gives, for any x > 



(see for instance iMassartl . 120071 ) , 



\Yij\ >x)=^e(\X-l\> 2nz ^ X ) < P e (\X - 1| > 2? " 

y l^Z^w* ^ZiWj | J \ ^"max ^"n 



<2exp<^ — + 2iog 1 + 



^"max TTmin 2 \ TTma.x 7T"n 



where X ~ X 2 (l)- Combining this bound with the conditional independence of the Ym's, 
Assumption [9] and letting D = diff(z* , wj^, z„, w m ), we obtain that for any e > 0, 



7T min eL» D / 27T n 

> eLM < 2 exp < 1 log 1 + 



This corresponds to Inequality (|29p and leads to the desired result. □ 

The following corollary is a direct consequence of the previous lemma and the fact that 
Assumption [5] is satisfied in this case with Lq = l/(27r m i n ). 

Corollary 9. Consider the model defined by ([1]) where f(-',ir) is a Gaussian distribution 
with fixed mean value m and variance ir £ II. Under ii) of Assumption [7] and Assump- 
tions®^ the conclusions of Theorem^ and Corollaries\l\to\^are valid. 

4.8 Mixture of Dirac and continuous distribution 

In this section, we assume that X y - follows a mixture of a Dirac mass at zero and a contin- 
uous distribution (on R for instance). This s ituation is particularly relevant for modeling 
sparse matrices ( Ambroise and Matiasl . 2011 ). In this context, the former parameter tt 



becomes now (n, 7) £ (0,1) x T and we let 

/(•;vr, 7 )=7r/(-; 7 ) + (l-7r)5o(-), (30) 

where 5q is the Dirac mass at 0. For identifiability reasons, we also constrain the parametric 
family {/(•; 7); 7 £ T} such that any distribution in this set admits a continuous cumulative 
distribution function (c.d.f.) at zero. Moreover, we shall assume that the distributions 
{/(■)7);7 ^ r} have exactly the same support so that for any 7 £ T, the random variable 
f(Xij; j) is P*-almost surely non zero. 
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Assumption 11. Each distribution in {/(-;7)i7 S T} admits a continuous c.d.f. at zero. 
Moreover, the distributions {/(s7)j7 S T} have exactly the same support. 

For instance, /(•; 7) may be absolutely continuous with respect to the Lebesgue measure. 
Another interesting case consists in considering the density (with respect to the counting 
measure) of the Poisson distribution, with parameter 7, but truncated at zero. Namely, for 
any k > 1, we let f(k; 7) = 7 fc /(/c!)(e 7 — 1) _1 . This leads to zero-inflated Poisson models 
and more generally, one could consider other zero- inflated counts models. 

In the following, we will assume that ir satisfies Assumption [7] and that the family 
{/(■> 7); 7 £ r} satisfies a concentration property on its likelihood ratio statistics as follows. 

Assumption 12. Fix (z*,w£J £ Uq and (z n ,w m ) in U with (z*,w£J 7^ (z n ,w m ). Let 
Yij = \og\f{Xij]^f z * w *)/f{Xij]^ ZiWj )] + c, where c is a centering constant. There exists a 

positive function ip^ax '■ (0> +°°) ~~ * (0, +00] such that for any x > 0, for any (i,j) € I, 
P^d^l > x\Xij ± 0) < 2exp{-sup(Ax - Vw(A))} := 2exp(-< ax (x)), 

AG/ 

where ip ma x(^) = rnax(jj) g jlogE* nWm (exp(AYjj)|Xj.,- ^ 0) exists /or any A G I C (0;+oo). 

Lemma 9. Under Assumptions [7| [771 and [7H we obtain that Assumption^ is satisfied, up 
to an extra factor 2, withip*(x) = ^ in (^ ax (x/2) Ax 2 /{8[log(l - a) - log a] 2 })/8. Namely, 
using the same notation as in Assumption^ we get 

¥ <™*m (|(,5- _ A 7r )( Z ;,w^,z n ,w m )| > £{ mri +nr 2 }) < 4exp[-</>*( £ ){mr 1 + nr 2 }]. 

Proof. According to Assumptions [7] and [TTJ the log ratio of the posterior probabilities (5 T 
as well 
where 



as well as its conditional expectation A 77 are always finite. Here, we have = Y^' + Y^?' 



/ -t \ f 1 ^ Z* IV* "\ f ^ Z* IV* \ 

= l{X^=0}log ^ + l{X^0}log l-^- + C1 , 

sf> = i { x tJ + o}io g f +Q = ux^o}^, 

where ci,C2 are centering constants. Conditional on (Z n ,W m ) = (z*,w^J, the families 
{■^ij j(i,jeX) anci {^ij 2 ^}(i,jex) respectively contain independent and centered random vari- 
ables. Moreover, as the random variables {Xip}{i,jeT) are bounded by 2[log(l — a) — log a], 

we can apply a Cramer-Chernoff bound on the deviation of each Y-j . For any x > 0, we 
write 

p<w*, ^|^| > x ) < p<w*, > +p <w^ A ^(2) 1 > ^ 

£ 2 exp {' 2[io 8 (i '-I?- bg.)» } +^, - ^ p * :w: ■ (fti £ ^ 2 I X « * 

< 2exp \ -— — - — — — ^ f + 2exp{-sup(Ax/2 - ^ max (A))}. 

[ 8[log(l-o) -log a] 2 J Aer 
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Let D = cliff (z*, w£j, z n , w m ). Combining these Cramer-Chernoff bounds with the respec- 
tive conditional independence of the {Y^p}ij and {Y^ }ij yields 

ij 

De 2 

<2exp <{ " i?r — j r : ^ > + 2 exp{-D SU P (Ae/2 - ^ max (A))} 

8[log(l-a)-loga] 2 J x &i 



<4exp \-D < ax (e/2)A 



c 2 



8[log(l — o) — log a] 2 - 

This corresponds to Inequality [29] (up to an extra factor 2) and yields the result. □ 
In order to ensure Assumption [5] we need the hypothesis to be satisfied on the family 

{/(•;7);7er}. 

Assumption 13. There exists some positive constant Lq such that for any 7,7' £ Tg£ 
and any (q, I), (q' , I') £ Q x C, we have 



x 



lo S T; 7t/( x ; 7q'Z')efe 

/(^;7g Z ) 



< L0II7 -t'I 



Note that we provided in the previous sections many examples of families for which this 
assumption is satisfied. Combined with Assumption this ensures that Assumption [S] is 
satisfied with Lq = a^ 1 + Lq. Then, the following corollary is a direct consequence from 
the previous results and Assumption [11] ensuring that ft max is always finite. 

Corollary 10. Consider the model defined by ([T]) where /(■; tt, 7) is a mixture given by (|3Up . 
Under Assumptions [I] [?] EH and \13l the conclusions of Theorem{]\ and Corollaries [I] 
to are vaZzd. 



5 Asymptotically decreasing connections density 

In this section, we explore the limiting case where the numbers of groups Q and L remain 
constant while the connections probabilities between groups converge to 0. This framework 
is interesting as it models the case where groups sizes increase linearly with the number 
of row/column objects, while the mean number of connections (i.e. non-null observations 
in the data matrix) increases only sub-linearly, mimicking for example budget constraints 
in terms of global consumptions. More precisely, we will consider two different setups, the 
first one being built on the binary case developed in Section 14.21 and the second one being 
built on the weighted case from Section 14.81 As in the previous sections, we assume that 
m < n, view m := m n as a sequence depending on n and state the results with respect to 
n — > 00. We shall furthermore assume that the probability of connection (binary case) or the 
sparsity parameter (weighted case) n q i )n depends on n and writes ir q ^ n = £, n ir q i where (£ n ) n >i 
converges to zero and ir q i is a positive constant. The sequence (£ n )neN controls the overall 
density of the block model and acts as a scaling factor while the parameters (^gi)( q ,i)eQx£ 
reflect the unsealed connection probabilities from the different groups. This parametrization 
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is analogous to the one studied in iBickel and Chenl l[200d ). We shall now assume that the 
unsealed connection/sparsity probabilities are well-behaved, as in Assumption [7] and shall 
introduce the new parameter sets denoted by U n and Hqc,u to account for the dependence 
on the data size (i.e. number of rows /columns). 

Assumption 14. The parameter sets U n and Hqc,u depend on the number of observations 
and we have 

IT C [a, 1 — a] for some a G (0, 1/2), 
II n :=^ n U = {^ n TT;7r eU}, 

RQC,n ■= irJ^-QC = {in^\ K G Hqc}, 
where (£ n )n>i is a sequence of values in [0, 1] converging to and such that 

log n log n 



and 



0. 



5.1 Binary block models with a vanishing density 

In this setup, the connectivity parameter 7r n = (^qi, n )(q,i)&QxC depends on n and may be 
arbitrarily close to 0. Accordingly, the constant /Cmm^n) defined in (jlUp depends on n and 
is no longer bounded away from 0. We thus reconsider Assumptions [3j [5] and the definition 
of 

^minC 71 "™) to exhibit the scaling in n of several key quantities in this setup. 
Lemma 10. Fix two parameters 7v n = £„7r and 7v' n = £ n 7r' in the set Hqc^, where 7T, 7r' G 



IIg£. Under Assumption\l4\ we have for all n and all (q,l), (q ' ,1') G Qx C 

f{x; TT q l jn ) 



1Q g ft I \ 



f{x;ir q >i>, n )dx 



< 



Cn||7T — 7T ||oo 



:= = x 2 /i4 in /[16{log(l - a) - log a} 2 ], 



(31) 
(32) 
(33) 



where 



Cmin(7I"*) 



1 - a 



x mm 



9 ''-; (g,0, (q',l') G Q x / <, J > 0. 



7T 



gi 



Proof. For any 7r, tt' G II and any £ G (0,1), the Kullback-Leibler divergence D(£tt\ |£7r') 
writes 



D^vrllevr') = ^log^ + (l-evr)log(^4 ! 7 

7T V 1 — C,7T 



-^7rlog 1 + 



IT — IT 



7T 



;i-evr)io g i + 
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Now, relying on the convexity inequality log(l + x) < x valid for x > — 1 and also on a 
Taylor series expansion of log(l + x), there exists some 6 with \9\ < \ir' — 7r| /vr such that 



D(£n\\£ir') > £(vr - vr') + £ 



(vr - vr') 



'\2 



1 



2tt (i + e) 2 

2 



£(vr - vr') 



2vr V 1 -^, 
Coming back to the definition ()10p of K m ; n (7r*) yields 

^min,n • — ^min(f n ) — ^min (£n"" ) 

= min^^n^'); (?, 0,(<//) e G * 

> CnCmin(T*), for all n. 

Note that K m ; njn scales with £ n only when II is bounded away from and 1. Otherwise a 
simple bound based on the comparison between Kullback-Leibler divergence and the total 
variation metric shows that K m i njn scales with £ 2 . 

A similar scaling can be found to replace Assumption Indeed, for any ir,ir',ir" E II 
and £ > 0, we have in the binary case 



Therefore, for any (q, I), (q',l') € Q x C, 



^ lo g^ + (1 _ evr -) log (i_J!L 

7T \i — C,TT 



< 



£|7T — ir'\ 



f(x] TTql n ) 

lo S 77 ? W W ^q'l\n)dx 

f( X ^ql,n) 



< 



£ n ||7T - 7r'| 



Finally, we need the correct scaling for ipn( x ) that appears in Assumption [3J Following 
the scheme of proof developed in Section 14.11 it turns out that the random variables Yij n 
defined by 



Yi 



ij,n 



Xij tTl lor 1 



7T 



ZiWj ,n 



( 1 ^z*w* ,n \ 

+ (1 - X ijin ) log 1 3 + c n , 



1 ZiWj,n 



(where c n is a centering constant) still satisfy that, conditional on (Z n ,W m ) = (z*,w*^), 
these are independent and bounded by 2 [log(l — a) — log a] . The same Hoeff ding's Inequality 
then yields with tp*(x) = ip*(x) = x 2 /u^ in /[16{log(l - a) - log a} 2 ]. □ 



Corollary 11. Under Assumption^ on the unsealed parameter set Hqc and Assumption 14 
Theorem [7] and Corollaries [7] to [3] remain valid with the following modifications 



1. c = fi 



min^mm 



-1 . 



/16; 



2. L = a' 

3. (c — 2Lo||7r — Tr*!!^) is replaced by £ n (c — 2L0II 71 " — ^lloo)- 
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Remark 4. Assumption is not in force in this theorem. Indeed, the scaling imposed 
with £ n in Assumption^J\ implies it: it forces logn/(n£^) — >■ and log n/(m n (£) ~^ an d 
thus makes the speed at which m n /n can go to depend on £ n . Note that m n ^ plays in 
Assumption^J\the same role as m n in Assumption^ 

Proof. The proof is essentially the same as the proof of Theorem [TJ We will only highlight 
the differences and show how the scaling log n/(m n ^) — > is derived. First Equation (|12|) 
from Proposition [1] now depends on n and should be 

E z n w m (>n (Zn) Wm ,z n „ w m )) > 2£ n (c' - L \\tv - 7r*||oo)(mri + nr 2 ). (34) 

where the original c = /^in^min/lG has been changed to d = fj^ m c m - m / 16. Next the 
set A(z^ , w£j, z n , w m ) must be changed so that we consider two-sided deviations between 
5 7r "(Z n , W m , z n , w m ) and its conditional expectation of order £ n (c' — -Loll 71 " — 7r *||oo)(^ r 'i + 
nr 2 ) instead of the previous (c — Lo|| 7r — 7r* ||oo) ('HiTi -\- TIV2) • Equation (|22p therefore turns 
to 



P*(A«, w^,z n ,w m ) n {(Z n , W m ) = (z*,w^)}) 

< 2 exp[-^(e„.(c' - 2L n))(mn + nr 2 )]M(z*, w* m ). (35) 

The set Oi is still defined as in Equation (|23p and on this set, Inequality (|21 j) and (|16j) are 
still statisfied. However the upper bound on P*(f2i) is modified as follows 

P*(ITi) < P*(H )+2|6| [{l+exp[-m^(C„(c / -2Lo7?))]} n {l+exp[-n^(en(c , -2Lo7?))]} m -l 

Combining the latter with the control of the probability of 0,q given in Proposition [1] and 
the quadratic nature of tjj*, we obtain 

£ n ,m ■= < 2QLexp[-(n A m)^ in /2] + 2|6|d re , m exp(d„ im ), 

where d n ,m = [reexp{— m^ l rp*(d — 2Lq7])} + mexp{— n$^rp*(d — 2Lq7])}]. The condition 
required to make the e n ^ m summable and conclude the proof is logn/(m£^) — > 0. This 
condition holds under Assumption 1141 □ 



5.2 Weighted models with a vanishing density 

We now consider the setup introduced in Section T4.81 except that we shall now assume that 
the sparsity parameters TT q i )n '■= (, n ftqi may be arbitrarily close to zero (see Assumption 1 14|) . 
Note that the parameters (7qz)( g ,z)gQx£ remain fixed. Moreover, Assumptions [TT1 and [T2l 
are assumed to hold. 

In the next lemma, we provide the scaling of t m i n (7r n , 7), or more accurately a lower 
bound thereof, and show that Assumption [131 is sufficient to guarantee the adequate scaling 
of the Lipschitz condition. 

Lemma 11. Fix two parameters ir n = £ n 7r and ir' n = £ n 7r' in the set Hq£ jU! where 7r,7v' € 



Groups posterior in LBMs and SBMs 



31 



Hqc- Under Assumptions \11\ to\14[ we have for all n and all (q, l),(q',l') £ Q x C 

K min,n := K min(Cn 7r j7 ) — ^(CminC 71 " ) 4" K<min(7 ))' (^6) 

< f l|7r ~ 7r/|l °° + LoIlT - 7'llocl , (37) 



X J\ X ^ql,n^ql> 



r n (x) := ^(s) = ^ ^(-) A 



,r 2 



2' 8{log(l -a) -logo} 2 

' (38) 



where 



K m in ■= «min(7*) = min { £> ( 7 J, 1 1 7 J,, ) ; (g,Z), (g', /') 6 Q x C,-y*i^ Tg'i'} > 



--min • — Trim 



Pmm(ir*) = ( ) x min J ; (g, 0, (<?', € Q x £, tt*, # < T J- > 0. 



Proof. For all tt,it',it" 6 II, 7, 7', 7" £ T and £ > 0, we have: 

/ log jt^rrk fte i")dx = evr" log - t + (i - ^) log 1^ 

+ [ log ^I fix-^dx. (39) 

Jx f{x;y) 

When (7r", 7 ") = (vr, 7 ), Equation ((39]) turns to 

D ((^,7)11(^,7')) = ^(^11^0 +^(7llV) > e ( ^~/ )2 (y^) 2 +^(7llV), 

from which we can deduce Inequality (|36l) . 

For general (tt",j"), Equation (|39p combined with Inequality (|32p and Assumption [131 
gives 

r f(n»£m- r,l\ | 7r _ 7r '| 

<C J '+££ol7-7 I 



7a f{x;£n,r) 



from which we can deduce Inequality (|37|) . 

Finally, in this setup Lemma is still valid and gives Equation (|38[) □ 

Now, we introduce an assumption about the quadratic nature of the function Vniax 
introduced in Assumption 1121 

Assumption 15. With the notation of Assumption for all x > 0, there exists some 
positive m x such that for all £ € (0,m x ) 

Remark 5. Assumption{T^ ensures that ip*(£x) defined in Equation L38\) does not decrease 
faster than £ 2 with £ and that the condition log(ra)/(m n £ 2 ) — > is the correct asymptotics 
in Corollary [TH iVoie also that Assumption [75| for all distributions considered in 

Section 
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Corollary 12. Under Assumption [JJ on the unsealed parameter set Hqc and Assump- 
tions [77] to QSl Theorem [JJ and Corollaries [JJ to [3] remain valid with the following modifica- 
tions 

1. L = a' 1 + L ; 

2. c = /x^ in (c min + aK min )/16; 

3. 7r is replaced by (£ n 7r,7); 

4. (c - 2L ||7r - -n^Hoo) is replaced by£ n (c- 2L \\ (ir, 7) - (ir*, 7*) Hoo). 

Proof. This result is proved following the proof of Theorem [TJ exactly in the same way as 
we did for Corollary [UJ with some changes in key quantities as listed in the corollary. □ 
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A Technical proofs 

Proof of Le mmalM Let us recal l that this proof is a generalization of the proof of (Propo- 
sition B.5 in lCelisse et all 1201 ll ). 

Since (z*,w£J £ U°, for any q £ Q and any I £ C, the number of entries in z* (resp. 
in w* n ) which take value q (resp. I) is at least [n// m i n /2] (resp. \m^L m \ n /2\). Up to a 
reordering of the vectors z* and w^, we may assume that the first Q |~fi// m in/2] entries of 
z* and the first L[mpi m ; n /2] entries of are fixed, with 

z n = (1) 2, . . . , Q, 1, 2, . . . , Q, . . . , 1, 2, . . . , Q, ^Q|- n ^ min /2]+i' • • • ' z n)i 
w* rn = (l,2,...,L,l,2,...,L,...,l,2,...,L,w;* rmAtm . n/2l+1 ,...,^). (40) 

Such ordering of the entries of (z* , w£J induces a specific ordering of the entries of (z n , w m ). 
For each k £ {1, . . . , |"n/_f m ; n /2] } (resp. each j £ {1, . . . , [~m// m i n /2] }), we denote by (resp. 
tj) the application from Q to Q (resp. from £ to £) defined by 

Vg £ Q, s fe (^ fe _ 1)Q+g ) = z (k _ 1)Q+q and V/ £ £, tj^-iji+j) = 

In other words, we write z n and w m in the form 

z n = (si(l), si(2), . . . , si(Q), s 2 (l), . . . , s 2 (Q), . . . , s rn/Jmin/2 ] (1), • • • , s rriMmin/2 ] (Q), 

Z Qr«Mmin/2l+l' • • • i z n) 

w m = (ti(l),ti(2), . . . ,ii(L),i 2 (l), • • • ,h(L), . . . , tp m/ , min / 2 ] (1), • • • 

^Lrm^/^+l,---,^)- (41) 

There are several possible orderings of z* (resp. w*J in the form (|40p and each one 
induces a different ordering of z n (resp. w m ) in the form (JHJ. For example, for any 
1 < k, k' < \n/i m i n /2] and any q £ Q, we can exchange z* k _^Q +q and ^* fc /_ 1 )g +(? which are 
both equal to q and this induces a permutation between Sk(q) and Sk'(q) in z n . (Similarly 
for any 1 < < [~m// m i n /2] and any / £ £, we can exchange ij(Z) and tj/(l) in w m .) 
Also, for any i > Q [n/_i m ; n /2] , z* is equal to some q £ Q and can be exchanged with 
z (fc-i)Q+<? ^ or an y ^ — ^ — r n / i min/2]. This induces a permutation between Sk{z*) and Zj 
in z n . (Similarly, we can exchange tj(w*) and tOj in w m for any i > L|~m// m i n /2] and any 
1 < j < [m// m i n /2].) Note also that the orderings of z* and are independent. As 
already said, each Sk (resp. tj) is a function from Q to Q (resp. from £ to £). We can 
therefore choose orderings of z* and which minimize the number (ranging from to 
[n/i m ; n /2]) of injective functions s as well as the number (ranging from to |~m/x m i n /2~|) of 
injective functions t. 

For 1 < h < [~n// m i n /2] and 1 < j < \m/i m i n /2], let 

B k j = \{M€QxC^* l jtir* k(q)m }\. 

We have of course diff(z n , w m , z*, w£J > ^J™^ 21 £^ min/21 B fcj . 

The simplest case is obtained when for any (k,j), we have -Bfcj > 1 and then 



nfJ>i 



diff(z n ,w m ,z n ,w m ) > — — x — - — > — (mri +nr 2 ), 



m/i r 



2 



ft 
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since both n < n and r 2 < m. In this case, the proof is finished. 

Otherwise, there is at least one (k,j) such that B^j = 0. In this case, we start by proving 
that at least one application among the s^/ and at least one application among the t~i are 
permutations. Indeed, consider some (k,j) with B^j = 0. Assume that Sk(q) = Sk(q') for 
some q ^ q' . Then for all I, we have tt*, = tt* , u , n = tt* , A , ,,■> = n*„ which contradicts 

t T- t ' ql s k (q)tj(l) s k {q')tj(l) q't 

Assumption [TJ The same holds if tj(l) = tj(l') for some I ^ V. Therefore if B^j = 0, both 
Sk and tj are injections and therefore permutations. 

Now, we prove that all applications sy which are permutations are in fact equal. Indeed, 
consider k' ^ k such that and sj~ are injections. Assume there exists some q such 
that Sk(q) / Sk>(q). Then exchanging Sk(q) and s k <(q) in z n decreases the number of 
injective applications Sj by 2, in contradiction with the minimality of the chosen ordering 
of coordinates in z*. Therefore, Sk = Sk'- Thus all injective s/v are equal to the same 
permutation s E &q. Similarly, all injective ty are equal to the same permutation t G &l- 
Since one of these pairs of permutations (sk,tj) is associated to the event Bkj = 0, this 
implies that (7v*) s,t = tv*. Note also that according to Assumption [2l we necessarily have 
(s,t) G 6. 

We now argue that as soon as there is at least one injective application (which is 
thus equal to s), we must have Zi = s(z*) for all i > Q\n/j, m i n /2] + 1. Otherwise, we 
could decrease by one the total number of injective by permuting Zi and s(z*), which 
contradicts the minimality of the number of injections. In the same way, if there is at least 
one injective application tj (thus equal to t), we have Wi = t(w*) for any i > L [~m/i m i n /2]+l. 

Let d\ (resp. cfe) be the number (possibly equal to 0) of non-injective (resp. tj). It 
comes from the two previous points that we can in fact write 

z n = (si(l), . . . , si(Q), ...,s dl (1), ...,s dl (Q), s(z dlQ+1 ), s(z*)), 
w m = (h(l), h(L), . . . , t d2 (l), . . . , t d2 {L),t(w d2L+1 ), *(u4))j 

where (s,t) € ©. Thus, we obtain that 

ri = d{z, n ,'L k n ) < ||z n - s(z*)|| < diQ, 
r 2 = d(w m , w^) < ||w m - i(w£J|| < d 2 L. 

Finally, for each (k,j) such that either s/% or tj is non-injective, we have B^j > 1. Therefore 
diff(z n ,w m ,z*,w^) > E E Bk J 



k=l j=l 

> d\ |m^ min /2] + d 2 \nij, min /2] - did 2 
di[m/i min /2] + d 2 \nfi min /2] 



2 



2Q 

> ^(mn + nr 2 ), 

where the last inequality comes from // m ; n < 1/Q. This concludes the proof of the lemma. 

□ 
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